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Abstract 

The increasing observation of mutual exclusivity correlations among cancer gene mutations is 
a key component for identifying driver events or pathways in cancer genome analysis. Here we 
report a rigorous statistical method to compute an exact p-value for the beyond-pairwise mutual 
exclusivity or co-occurrence relationships among cancer gene mutations by enumerating a null dis¬ 
tribution of overlapping mutations across more than two genes. The validity and the advantage of 
our method is explicitly demonstrated in both cancer gene mutations and simulation data through 
the comparison to the permutation test. 
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I. INTRODUCTION 


The recent development of high-throughput genome sequencing technologies reveals a re¬ 
markable complexity in genetic and epigenetic aberrations characteristic of cancer initiation 
and progression DEI. Given genetic heterogeneities across samples even in a single cancer 
type one of challenges is to distinguish underlying driver events from a myriad of random 
passenger events [am]. The standard approach for discovering cancer driver genes is to 
identify genes with a significantly higher mutation rate across samples than the background 
mutation rate [H]. In addition to the mutational recurrence across samples the recent cancer 
gene discovery is often corroborated by evaluating a significance of individual mutations or 
genes in the context of mutation patterns projected onto cellular signaling and regulatory 
pathways [6]. The mutual exclusivity relationship is commonly observed among driver mu¬ 
tations on the same pathway and plays an important role in discovering cancer driver events 
or pathways in many network and pathway analysis algorithms 

Fisher’s exact method HDl or hypergeometric test provides an exact p-value for the pair¬ 
wise mutual exclusivity in two genes by directly enumerating more extreme cases with over¬ 
lapping mutations equal or less than the observed one. However, it is non-trivial to extend 
the pairwise exact test to the gene set consisting of more than two genes. Considering an 
illuminating role of mutation patterns in the discovery of cancer driver genes or pathways 
it is very demanding to develop a systematic method to evaluate a statistical significance 
of the beyond-pairwise correlations or anti-correlation among mutations. In this paper, we 
report a rigorous statistical method to determine an exact p-value for the beyond-pairwise 
mutual exclusivity or co-occurrence relationships among cancer mutations in an arbitrary 
number of genes. Our study illustrates that a null distribution of overlapping mutations 
across genes can be determined via a sequential operation of a hypergeometric sampling and 
the exact p-value can be computed through a simple recursive formula for given mutation 
data. We validated our method in both cancer gene mutations and simulation data through 
the comparison to the permutation test. 
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II. METHODS 


Let us start by briefly reviewing the pairwise mutual exclusivity, in which two genes gi 
and g 2 had rii and n 2 mutations, respectively, across n samples. When rii mutations in gi are 
randomly chosen from n samples, and another random samples lead to n 2 mutations in 5 f 2 , 
the probability that two genes have z overlapping mutations exactly follows a hypergeometric 
distribution for G [max(0, ni + 77-2 — n), min(?7,i, 772 )] as 

/ni'j /n-ni\ 

P2iz) = H{z;ni,n-ni,n2) = ( 1 ) 

\n2/ 

being a binomial coefficient. The distribution P 2 {z) describes a hypergeometric sampling 
characterized by the probability of choosing white balls in 772 draws without replacements 
from a hnite population of ni white balls and ( 77 — 711 ) black balls. The p-value for the pairwise 
exclusivity is determined by summing all probabilities of more extreme cases with z < Zq, Zq 
being the observed overlapping mutations, as Xlzlz* -^ 2(^)5 = max( 0 , 7 ii + 772 — n). The 

hypergeometric test with Eq. ([^ is identical to the one tailed version of Fisher’s exact test. 

The mutation data for m genes with each 77j mutations (i = 1, ..., 777) across n samples 
can be represented by ri x 777 binary matrix z 4 , Aij = 1 if gene gj is mutated in sample 
1, and 0 otherwise. Here we quantihed the overlapping mutations hy z = r(y 4 ) — uj{A), 
r(H) = Yhij ^ij ci-nd oj{A) = ^^max(Hji,..., Aim)- Note that oj{A) represents the number 
of samples having at least one mutation across genes and ^ (> 0) becomes zero only if 
mutations among genes are perfectly exclusive. 

Our goal is to construct a random null distribution characterized by the probability that 
the set of m genes (777 > 3 ) have 2; overlapping mutations. For this purpose we hrst consider 
the simplest case of 777 = 3 . Without a loss of generality we assume tii > 772 > by 
re-ordering genes. Since the probability for x overlapping mutations between gi and g 2 is 
already determined as iL(a;; 771 , n — rii, 772 ) the remaining step is to enumerate newly formed 
overlapping mutations with an addition of gs for the hxed overlaps x between gi and g 2 - The 
key Ending is that the probability that all three genes have extra y overlapping mutations 
in addition to x is also determined by a hypergeometric sampling, yielding H{ii]p,q,n^) for 
y G [max(0,p-1- 773 —’ 2 -), min(p, 773)], p = max(77i -I-772 — x,n) and q = n — p. In analogous to 
the pairwise mutual exclusivity, this distribution describes a likelihood of choosing y white 
balls in 773 draws without replacements from a finite population of n size comprised of p 
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KRAS (60) 
STK11 (34) 
EGFR (30) 


FIG. 1 : Mutations in KRAS, STKll, and EGFR in lung adenocarcinoma project m 

white balls and q black balls. Note that the number of white balls is now (ui +n2 — x) due 
to existing x overlapping mutations between gi and g2. 

Summing probabilities for all possible choices of x and y the final null distribution of 
observing ^ overlapping mutations is obtained as 

722 ns 

= E E P 2 {x)H{y]p,q,n^) 5 [z-{x + y)] ( 2 ) 

y=y* 

X* = max( 0 ,? 7 ,i + 77-2 — n), y* = max( 0 ,ni + n2 + — n), p = max(? 7 ,i + 77-2 — x,n), and 

q = n — p. For an arbitrary m (> 3 ) Pm{z) is iteratively determined using a following 
recursive relation, 

P'm 

Pmiz) = ^ ^ Pm-i{x)H{y;p,q,nm) 6 [z - (x+ y)] ( 3 ) 

x=x* y=y* 

where x* = max(0,77i + — n), y* = max(0, rii + + Um), p = max(77i + — x, n) and 

q = n — p. Here n*^ = '^'^=2 denotes the number of maximum overlapping mutations in 
the gene set consisting of (71,..., gm-i- 

III. RESULTS 

We validated our method in the mutation data of three driver genes, KRAS, STKll, 
and EGFR, in the lung adenocarcinoma sequencing project m- This gene set was iden¬ 
tified as the most exclusive one for 777 = 3 by DENDRIX analysis [U]. The mutation data 
across 163 samples is represented in Fig. resulting in 77 i = 60 , 712 = 34 , 773 = 30 , and 
zo = 14 . In Fig. 15 a), the exact null distribution Psiz) from Eq. (|5 is compared with the 
approximate Pz{z-,Np) determined from the permutation test with varying Np, Np being 
the number of permutations. Note that P^{z,Np) is only non-zero in a narrow dynamic 
region corresponding to [zi,Zy\, P^{zi^u) ~ 1/lVp, while ^3(2) spans a full dynamic range 
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FIG. 2 : (a) The exact null distribution Log[P3(z)] of Eq. ^ and approximate Log[P3(2;, A^p)] from 
the permutation test with varying Np as a function of overlapping mutations z for the gene set, 
KRAS, STKll, and EGFR, in lung adenocarcinoma project m, and (b) p-values as a function of 
Np. The vertical and horizontal dashed lines corresponds to zq and the exact p-value, respectively. 

of z G [max( 0 ,^™ni — n), max( 0 , ^*)] = [ 0 ) 64 ]. The p-value from the permutation 
test in Fig. |^b) asymptotically converges to the exact p = 1.673 x 10 “^ with increasing 
Np, but p-values are zero for small Np since the exact p-value < 1 /iVp. We also tested 
our method in a more complicated case with n = 500 and m = 5 using a simulation data 
(see Supplementary Figure and text for details). Finally, we would like to stress that our 
method is equally applicable to the signihcance test of co-occurring mutations by summing 
probabilities of more extreme cases with ^ > zq, yielding p-value = J 2 z>zo ^rn{.z). 


IV. SUMMARY 

In summary, a rigorous statistical method to evaluate a signihcance of beyond-pairwise 
mutual correlations among mutation data is developed by establishing an exact null distri- 
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bution of overlapping mutations across genes. In contrast to the permutation test, which 
is approximate and computationally demanding with increasing m, our method enables an 
instant calculation of exact p-values regardless of the size of interesting gene set, making 
our method well-suited to an extensive search or prioritizing driver events or pathways in 
heterogeneous mutation data combined with other network or pathway analysis algorithms. 
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Supplementary Material for ”An exact method to compute a p-value for the 
beyond-pairwise correlations among cancer gene mutations” 

Simulation Data 


To validate our method in a more complicated situation we created a following simulated 
mutation data with n = 500 and m = 5 characterized by 


^ 1 if i e [ 1 , 200 ] for j = 1 , 

1 if z e [101, 250] for j = 2, 

1 if z e [201, 300] for j = 3, 

1 if z e [351,450] for j = 4, 

1 if z e [431,480] for j = 5, 

0 otherwise. 


(4) 


rzi = 200, rz 2 = 150, = 100, 77,4 = 100, = 50, and Zq = 170. In Eq. (|^, non-zero 

elements of Aij for each j could be assigned to any samples under the constraints of n, and 
the number of overlapping mutations in each pair of genes, and the hnal null distribution 
and corresponding p-value are not affected by those permutations. 

The exact null distribution P 5 {z) was compared in Fig. |^a) with approximate ones 
P^{z,Np) at various Np, indicating that more than 10 ® random permutations are needed 
to provide a fair comparison around z ^ z, z being the mean overlapping mutation, as ver- 
ihed in relative errors in Fig. |^b). However, as clearly seen in Fig. [^b), ^ 5 ( 2 ;, Np) provides 
approximate probabilities in a narrow dynamic region of z G [zi, Zu], P^izi^u) ~ in 

comparison to ^ 5 ( 2 :) spanning a full dynamic range of z G [100,400]. As expected, the per¬ 
mutation test failed to provide any signihcance measure for the exclusivity since the correct 
p-value of 8.134 x 10“^® is far less than the minimum resolution 1/Np in the permutation 


test. 
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FIG. 3; (a) The exact null distribution P^{z) and approximate distributions ^ 5 ( 2 ;, Np) from the per¬ 
mutation test with varying Np from 10 ^ to 10 ® as a function of overlapping mutations 2 for the simu¬ 
lation data in Eq. Q with n = 500 andm = 5, (b) relative errors = \{P^{z)—P^{z^Np))/P^{z)\ 
at non-zero P^{z,Np)^ and (c) Log[P 5 ( 2 ;)] (solid line) and Log[P 5 ( 2 ;, 10®)] (diamonds). Only non¬ 
zero values of ^ 5 ( 2 , 10 ®) were shown and the dashed line corresponds to zq in (c). 
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